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Abstract 

We introduce a graph-theoretic approach to extract clusters and hierarchies in complex data-sets in an 
unsupervised and deterministic manner, without the use of any prior information. This is achieved by 
building topologically embedded networks containing the subset of most significant links and analyzing the 
network structure. For a planar embedding, this method provides both the intra-cluster hierarchy, which 
describes the way clusters are composed, and the inter-cluster hierarchy which describes how clusters 
gather together. We discuss performance, robustness and reliability of this method by first investigating 
several artificial data-sets, finding that it can outperform significantly other established approaches. Then 
we show that our method can successfully differentiate meaningful clusters and hierarchies in a variety 
of real data-sets. In particular, we find that the application to gene expression patterns of lymphoma 
samples uncovers biologically significant groups of genes which play key-roles in diagnosis, prognosis and 
treatment of some of the most relevant human lymphoid malignancies. 

Introduction 

Filtering information out of complex datasets is becoming a central issue and a crucial bottleneck in any 
scientific endeavor. Indeed, the continuous increase in the capability of automatic data acquisition and 
storage is providing an unprecedented potential for science. However, the ready accessibility of these 
technologies is posing new challenges concerning the necessity to reduce data-dimensionality by filtering 
out the most relevant and meaningful information with the aid of automated systems. In complex datasets 
information is often hidden by a large degree of redundancy and grouping the data into clusters of elements 
with similar features is essential in order to reduce complexity fll. However, many clustering methods 
require some a priori information and must be performed under expert supervision. The requirement of 
any prior information is a potential problem because often the filtering is one of the preliminary processing 
on the data and therefore it is performed at a stage where very little information about the system is 
available. Another difficulty may arise from the fact that, in some cases, the reduction of the system into 
a set of separated local communities may hide properties associated with the global organization. For 
instance, in complex systems, relevant features are typically both local and global and different levels 
of organization emerge at different scales in a way that is intrinsically not reducible. We are therefore 
facing the problem of catching simultaneously two complementary aspects: on one side there is the need 
to reduce the complexity and the dimensionality of the data by identifying clusters which are associated 
with local features; but, on the other side, there is a need of keeping the information about the emerging 
global organization that is responsible for cross-scale activity. It is therefore essential to detect clusters 
together with the different hierarchical gatherings above and below the cluster levels. In the literature 
there exist several methods which can be used to extract clusters and hierarchies [l}|3] and the application 
to biology and gene expression data has attracted a great attention in recent years pj-fT]. However, in 
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these established approaches, to extract discrete clusters, one must input some a priori information about 
their number or define a thresholding value. This introduces other potential difficulties because complex 
phenomena are often associated with multi-scaling signals which cannot be trivially thresholded. In 
this paper, we propose an alternative method that overcomes these limitations providing both clustering 
subdivision and hierarchical organization without the need of any prior information, without demanding 
supervision and without requiring thresholding. 

In recent years, several network based approaches have been proposed to describe complex data-sets 



and applied to several fields from biology [8j|9] to social and financial systems 10 11 . Indeed, networks 
naturally reflect in their set of vertices the variety of elements in the system, they reflect in their edges the 
plurality of the interrelations between elements and they encode in their dynamics the complex evolution 



and adaptation of the system 12 ■ 16 . In this paper we apply the network paradigm to the study of 
complex data-structures. In our approach a graph with constrained complexity is built by means of a 
deterministic construction inserting recursively the most relevant links. In this construction, complexity 
is constrained by embedding the graph on an hyperbolic surface of genus g (where the genus is the 
number of handles of the surface) [l7 18 . The Ringel- Youngs theorem ensures that for n vertices the 



complete graph, Kn, can be always embedded on a surface with large enough genus [g ~ 0{n^)) 19 
Any graph is a sub-graph of Kn and therefore any graph can be embedded on a surface. In this paper 
we are interested in the limit where graphs are sparse and they are embedded on simple surfaces. The 
simplest case is 5 = and the resulting graph is called Planar Maximally Filtered Graph (PMFG) and it 
is a triangulation of a topological sphere. Topologically embedded graphs on planar surfaces {g = 0) have 
a relatively small number of edges (0(n)) but they have high-clustering coefficients, they can display 
various kinds of degree distributions, from exponential to power-law tailed, and they can be used as 



a platform for modeling other systems [17j 21 ■ 24 . It has been shown that PMFG graphs are efficient 
flltering tools having topological properties associated to the properties of the underlying system [T8{[20) . 
This makes the PMFG a desirable tool to extract clusters and hierarchies from complex data-sets. 

The general idea at the basis of our method is to use the topological structure of PMFG graphs to 
investigate the properties of the data-sets. A detailed description of our clustering and linkage procedure 
is reported in the Methods section. For brevity, in the rest of the paper, we will refer to our clustering 
and linkage method as the DBHT technique. 



Results 

In this section, we apply the DBHT technique to various data sets ranging from artificial data with 
known clustering and hierarchical structures to real gene expression data. Comparisons are made be- 
tween the results retrieved by the DBHT technique and some of state-of-the-art cluster analysis tech- 



niques such as k-means-|--|- 25 , Spectral clustering via Normalized cut on k-nearest neighbor graph 



(kNN-Spectral) [26)|27|, Self Organizing Map (SOM) ^ and Q-cut [29]. Let us here stress that all 
these techniques -except DBHT- are non-deterministic and require some a priori information in order 
to setup the initial parameters. To compare with the DBHT technique, we run the other techniques for a 
broad range of parameters and pick the set of parameters that are best performing in average. This is an 
important negative bias against the DBHT technique that however, as we shall see shortly, still outper- 
forms consistently the state-of-the-art counterparts. We also tested the capability of DBHT technique to 
correctly detect the hierarchical organization by applying it to known synthetic datasets and comparing 
the results with the outcomes from average and complete linkage techniques. Furthermore, we explored 
the meaningfulness of the hierarchical gathering of clusters and the significance of their subdivision in 
sub-clusters by looking at the functional properties of these gatherings and splittings in real datasets. 
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Tests DBHT clustering on synthetic data 

We have evaluated performance of the clustering techniques by comparing their outcomes with the known 



artificial clustering structure by using a popular external validity index: the adjusted Rand index 30 
which returns 1 for a perfect match and in average for a random guess. Specifically, we have generated 
correlated data-series by using a multivariate Gaussian generator (MVG) |31 that produces N stochastic 
time series yi{t) of length T = 10 x TV with zero mean and Pearson's cross-correlation matrix R that ap- 
proximates an input correlation structure R* which is a bolck-diagonal matrix where the blocks represent 
the clusters and may have different sizes. The matrix R* has all ones on the diagonal, it has zero corre- 
lations outside the blocks (p°"* = 0) and it has a correlation value p"** inside the blocks. Furthermore, 
we added a number Nran of random correlations unrelated to the cluster structure. We have also gener- 
ated multivariate Log-Normal distributions by taking the exponential of MVG series generated by using 
reference correlation Ri^g which is devised to retrieve the correct approximation of R* with log-normal 
statistics [32j. To these correlated series we have added a noise rii{t) obtaining y'i{t) — yi{t) + c(7irii{t), 
where Ui is the standard deviation of yi{t) and c is a constant that can be used to tune the relative 
amplitude of noise. We have tested normally distributed {p{ri) oc exp(— 77^/2)), log-normally distributed 
{p{ri) oc exp(— log(77)^/2)) or power-law distributed {p{ri) oc I/t;""*"^) noises. We have used different val- 
ues for the relative amplitude of noise c and, in the case of power-law distributed noise, we have also 
varied the exponent a. By increasing the effect of noise and/or the number of random elements, the 
Pearson's cross-correlation matrix R passes from a very well defined structure similar to R* to a less 
defined structure where the difference between the average measured intra- and inter-cluster correlations 
in i?, (/O*") — becomes negligible. 
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Figure 1. Demonstration that the DBHT technique can outperform other state-of-the-art clustering 
techniques, namely: k-means-|--|- ^25, , Spectral clustering via Normalized cut on k- nearest neighbor 

Seff Organizing Map (SOM) [28], and Q-cut 
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29 . This figures reports the 



adjusted Rand index [30| for the comparison between the the 'true' partition embedded in the 
artificially generated data and the partition retrieved by the clustering methods. In these examples we 
have eight clusters of size 5 elements and one cluster of size 64 elements with p*"* = 0.9, = and 
Nran = 25. The plots report average values over a set of the 30 trials. The horizontal- axis reports the 
gap between average intra- and inter-cluster correlations dR — (p*") — (p°") that becomes smaller when 
the noise c increases, (a) Normally distributed correlated datasets with added Normal noise with c 
varying from to 4. (b) Log-Normally distributed correlated datasets with added power law noise with 
a = 1.5 and c varying from to 0.1. 



Figure [T] compares the performance of the DBHT technique with k-means++, SOM, kNN-Spectral 
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and Q-cut for correlated synthetic datasets consisting of 129 data series generated both with normal 
and log-normal statistics, with normal or power law noise with p*"* = 0.9, = and Nran = 25. 
This example refers to a rather extreme case where the clusters have highly dis-homogeneous sizes with 
one large cluster with 64 elements and eight clusters with 5 elements each. As one can see from Fig. 
[l] in this case the DBHT technique is strongly outperforming the other methods. In the supporting 
information, we report on a large number of cases where we demonstrate that consistently the DBHT 
technique is better, or at least equivalent, to the best performing counterparts for a very broad range 
of combinations of different kinds of artificial data. Let us here note that stochastic techniques such as 
k-meansH — h and SOM are particularly sensitive to noise distributions and tend to perform poorly with 
fat-tailed distributed noise. On the other hand, the Qcut technique carries an inherent resolution limit 
that over-shadows small clusters f33|. The DBHT technique instead is less affected by these factors and 
it consistently delivers good performances for across the range of parameters. 




XT 




Figure 2. Demonstration that the DBHT technique can detect clusters at different hierarchical levels 
outperforming other established linkage methods. The synthetic data are generated via multivariate 
Gaussian with added power law noise with exponent a — 1.5 and c — 0.1. (a) Input correlation R* for a 
synthetic data structure with nested hierarchical clustering with 4 'large' clusters, containing 8 
'medium' clusters, containing 16 'small' clusters, (b) Dendrogram associated with the DBHT 
hierarchical structure, (c) Dendrogram associated with the Average linkage, (d) Dendrogram 
associated with the Complete linkage. 
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Tests DBHT hierarchy on synthetic data 

We have tested the capabihty of the DBHT technique to detect hierarchies by simulating data with 
hierarchical structure such that smaller clusters are embedded inside larger clusters making a nested 
structure with different intra-cluster correlation. An example is shown in Fig[2][a) where we report an 
input correlation R* which is a nested block-diagonal matrix with zero inter-cluster correlation and with 
a structure of 4 'large' clusters (64 elements each) with intra-cluster correlation of p™* = 0.7. Each of 
the large clusters contains inside two 'medium' clusters (8 in total with 32 elements each) with p™* = 0.8 
that contain inside two 'small' clusters (16 in total with 16 elements each) with p™* — 0.95. We have 
simulated 30 different sets of data series of length T = 10 x by using MVG from R* with added 
power law noise with a = 1.5 and c = 0.1. We have tested the efhciency of the DBHT technique by 
moving through the hierarchical levels varying the number of clusters from only one at the top hierarchy 
to the number of elements at the lowest hierarchy. Figj2|^b) shows the dendrogram retrieved with the 
DBHT technique. By following the hierarchy from top to bottom, one can see that a structure with 4 
main clusters rapidly emerges and its partition coincides exactly with the 'true' partition in R* . Then 
these clusters correctly split into two parts each making 8 clusters in total scoring a value of 0.97 for 
the adjusted Rand index with respect to the 'true' partition at this level. Finally, these 8 clusters split 
again producing a partition that has an adjusted Rand index of 0.94 with respect to the 'true' partition 
at this level. The partition into discrete clusters identified by the DBHT is almost identical with this 
last one having 17 clusters instead of the 16 'true' clusters and achieving also an adjusted Rand index of 
0.94 (see supporting information). One can see from Fig[2](c,d) that, instead, the complete and average 
linkages give a less clear hierarchical structure. Several other examples are reported in the supporting 
information. The better performances of the DBHT technique over linkage methods can be explained by 
the fact that linkage techniques suffer from the greedy nature of the algorithm, where a misclassification 
of an element in an early stage of clustering can never be remedied (l]|3]. The rate of misclassification 
depends on the type of linkage distance, with the average linkage optimized for isotropic clusters, and 
complete linkage optimized for compact and well-defined clusters. On the other hand, DBHT hierarchy 
is based on a combination of linkage distance and topological constraints at multiple hierarchical levels: 
bubbles, clusters, bubble tree. This reduces the error rate with respect to the complete linkage distance. 




Figure 3. Comparison beween the clustering obtained via (a) DBHT technique, (b) best Qcut and (c) 
best kNN-Spectral on iris flower data set from Fisher [34]. The labels inside the symbols correspond to 
the three different types of flowers: (s) Iris Setosa; (v) Iris Versicolour; (g) Iris Virginica. The shapes of 
the symbols correspond to the clusters retrieved by the different clustering techniques. 

Application of DBHT technique to Fisher's Iris Data 

One of the typical benchmark referred in clustering analysis literature is the iris flower data set from 
Fisher fS?. Briefly, the data set contains the measure of four features (i) sepal length; (ii) sepal width; 
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(iii) petal length; (iv) petal width, for 50 iris plants from three different types of iris, namely (1) Iris 
Setosa; (2) Iris Versicolour; (3) Iris Virginica. The data set is available from UCI Machine Learning 
Repository website ^35^. It is known that, the clustering structure of the data set linearly separates one 
type of Iris from the other two. The remaining two types are instead not linearly separable and their 



subdivision is a classical challenge for any clustering technique 35 . Here, in order to compute clustering 
and hierarchies we have used the pair-wise Euclidean distance Deuc(*,j) = — Xj\\ as dissimilarity 

matrix and Reuc(i, j) = cxp (— ) as similarity matrix 27 , where a is the standard deviation of 
Deuc(*i i) foi' all pairs of (i, j). From these measures, we directly computed clustering and hierarchies via 
DBHT technique obtaining the graph structure shown in Fig|3][a) where one can see that all the three 
iris types are rather well separated occupying different parts of the graph. By extracting three clusters 
form the DBHT hierarchy we observe that the first flower type (Iris Setosa) is fully separated and the 
other two are rather well divided with only a few misplacements. The DBHT results are compared with 
other two graph-based techniques, Qcut and kNN-Spectral techniques computed using Reuc fo^ a range 
of kNN — 2, . . . , {N — 1). These methods ar non deterministic and we retained only the best partitions 
which give the highest adjusted Rand score which are shown in Fig[3]^b,c). We can observe that Qcut 
and kNN-Spectral techniques provide a poorer separation of the last two flower tipes (Iris Versicolour 
and Iris Virginica). This is quantified by the adjusted Rand index computed by comparing with the true 
partition that gives 0.89 for DBHT and 0.85 for both Qcut and kNN-Spectral. Indeed, these last two 
techniques both misplace 8 elements of the two groups whereas DBHT misplaces only six. Other two 
clustering techniques, k-meansH — h and SOM, have been run over 30 iterations with a input number of 
clusters fc = 3, yielding to poorer partitions with the largest adjusted Rand indexes respectively of 0.73 
and 0.80 which are well below the score achieved by the DBHT technique. 

Application of DBHT technique to gene expression data set from human can- 
cer samples 

We have applied the DBHT technique to analyze gene expression data sets collected by Alizadeh et 
al [36| concerning 96 malignant and normal lymphocyte samples belonging to the three most relevant 
adult lymphoid malignancies, namely: Diffuse Large B-Cell Lymphoma (DLBCL); Follicular Lymphoma 
(FL); Chronic Lymphocytic leukemia (CLL); together with other 13 kinds of samples from normal human 
tonsil, lymph node. Transformed Cell Line, Germinal Centre B, Activated Blood B, and Resting Blood 
B. This data set has already served as a benchmark to evaluate performance of clustering techniques 



on gene expression data 29 , 37 and this is why we have chosen to test our method on this referential 
dataset. Patients with DLBCL cancer type have variable clinical courses and different survival rates and 
there are strong indications that DLBCL classification includes more than one disease entity [36:. The 
challenge for a clustering algorithm is therefore to analyze the DLBCL genetic profiles and individuate 
different subtypes of DLBCL to be associated with different clinical courses. Indeed, various studies 
have attempted to highlight genetically significant genes that can be of clinical significance to improve 



the DLBCL patients' diagnosis and clinical treatment 36 38-43 . In particular, it is understood that 
DLBCL is a very heterogeneous type of Lymphoma and there are at least three distinct subtypes which 
differ in treatment methods for improved survival of the patients f36''38''44'. 

We have first applied the DBHT technique on the gene expression data by using Pearson's correlation 
as similarity measure, and correlation distance as the dissimilarity measure. The DBHT clustering yielded 
to 11 sample-clusters, which are shown in Fig. |4j One can immediately note that all FL samples are 
gathered together in one cluster that also contains the DLCL-0009 sample which it has also been associated 



to FL in other studies on the same data 29p6) . Transformation of FL to DLBCL is common 45 1, and this 



cluster suggests that DLCL-0009 may have derived from FL, sharing therefore common gene expression 
patterns. We also observe in Fig. |4] that all, except one, the CLL samples occupy a single cluster. The 
missing CLL sample is attached to this cluster and it is included in a cluster containing Resting Blood 
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Figure 4. Sample-cluster structure for 96 malignant and normal lymphocyte samples from Alizadeh et 
al 2000 [36], the labels inside the symbols correspond to the different sample types as listed in the 
legend. The DBHT technique retrieves 11 sample-clusters here represented with different symbols (see 
legend). The underlying network is the PMFG from which the clustering has been computed. 



B samples which have indeed similar expressions patterns and clinical similarity to CLL and are often 
merged together by other clustering techniques [29| . DLBCL cancer types appear in four different sample- 
clusters which are however lying together in a branch of the PMFG graph. Significantly, these clusters 
also include some other GCB-like samples. Remarkably, if we look at the patient survival rates (Table [T]), 
we see that these four sample-clusters are extracting DLBCL cancer subtypes with very different clinical 
courses. Indeed, if we consider separately the patients with DLBCL type of Lymphoma accordingly with 
the subdivision into the four sample-clusters '1', '5', '7' and '9' (from bottom to top of the Fig.|4]), they 
respectively have survival rates 100%, 56%, 15% and 29% (see Table[l]for details). In the work of Alizadeh 
et al |36| survival rate differentiation in DLBCL patients was associated with two main cancer subtypes, 
namely GCB-like and ABC-like, with the latter considered more fatal that the former. We can note that, 
in our clustering, sample-cluster '1' contains GCB-like DLBCL, and it also includes other GCB samples 
such as tonsil GCB, tonsil GC fibroblast, and high survival rates are common in GCB-like cancer types 
(see Fig. 10 in supporting information). Cluster '5' is also characterized by GCB-like DLBCL samples, 
however its proximity to ABC- like clusters (see Fig. 10 in supporting information), may be the clue to 
relatively low survival rate in comparison to cluster '1'. Cluster '9' is characterized by a majority of 



ABC-like DLBCL to which we may attribute its relatively low survival rate 36 . On the other hand, 
cluster '7', which shows a surprisingly low survival rate, has instead a significant number of GCB-like 
DLBCL samples, this might signal the existence of another relevant DLBCL subtype. 

In order to functionally validate these sample-clusters, we have analyzed the expression profiles for 
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Sample Cluster '7' 


Sample Cluster '9' 


Cluster Size 
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20 


# of DLBCL 


4 


9 
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17 


jj= Survived over Syr 


3 (100%) 


S (S6%) 


1 (14%) 


S (29%) 


jj= Died in Syr 





4 


6 


12 



Table 1. Survival rates of cancer patients with DLBCL type of Lymphoma. The patients are divided 
in four groups corresponding to the four sample-clusters containing the DLBCL samples (see Fig. l4|). 
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Table 2. Number of up-regulated (on the left) and / down-regulated (on the right) expression profiles 
for each group of clones with known physiological roles as reported in Ref. ^36j . The sample-cluster 
labels are as in Fig. |4j Some significant up-/down-regulation patterns, commented in the text, are 
highlighted by boldface font. 



6 groups of genetic clones with known physiological roles, namely: GCB- Germinal Center B cell (111 
clones), LyN- Lymph Node (136 clones), PBC- Pan B Cell (81 clones), Pr- ProHferation (312 clones), TC- 
T Cell (111 clones) and ABC- Activated B Cell (86 clones) [36]. The significance of regulation patterns 
has been evaluated by one-tailed T tests with cut-off p-value of 0.01. The number of up-/down-regulated 
profiles for each group of clones is shown in Table [2j Significant up- / down- regulation patterns of the 
expression profiles in the sample-clusters reflect the biological relevance the group of gene-clones in each 
sample-cluster. We first observe that sample-clusters containing DLBCL cancer types (e.g. cluster '1', '5', 
'7' and '9') distinguish from other samples by up-regulating more clones from Pr, hence reflecting higher 
proliferative index. Sample clusters associated to DLBCL are also differentiating among themselves, for 
instance, sample-clusters '1' and '5' both up-regulate GCB clones but they differ significantly in the 
up-regulation of LyN clones, supporting the subdivision of GCB-like DLBCL by these sample clusters. 
Similarly, sample-cluster '7' shows a unique expression signature that highlights a strong up-regulation 
of LyN clones in comparison to other clones. Given that this sample-cluster is a mixture of ABC-like and 
GCB-like DLBCLs, and it shows distinctively low survival rate, this again suggests that sample-cluster '7' 
is a different subtype of DLBCL outside of GCB-/ABC-like classification. Overall, these results indicate 
that DBHT clustering technique is able to reveal a meaningful classification of biologically significant 
DLBCL subtypes which is richer than what proposed in the original study by Alizadeh et al |36| . 

Let us now move a step further and use the DBHT technique to identify significant groups of genes 
that are of relevance for particular cancer samples. Indeed, an accurate identification of significant genes 
is crucial in treating the tumor cells as there are a large number of different genetic mechanisms from 



which these tumor cells originate, hence they require different treatments 46 47 . We have therefore 
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Figure 5. Left: Heat map of gene expression profiles for the six significant clusters of genes. Each row 
represents the expression profile from a clone, and each column represents a sample. The samples are 
organized according to the DBHT hierarchy as shown on the dendrogram on the top. Significant 
gene-clusters are highlighted with different colors as follows (from top to bottom, colours online): Red - 
gene-cluster '44' (significant for sample-cluster '1'); Green - gene-cluster '109' (significant for 
sample-cluster '4'); Blue - gene-cluster '1' (significant for sample-cluster '5'); Black - gene-cluster '4' 
(significant for sample-cluster '7'); Magenta - gene-cluster '125' (significant sample-cluster '9'); Yellow - 
gene-cluster '102' (significant for sample-cluster '11'). The same color scheme is used on the bottom of 
the heatmap to denote the corresponding sample-clusters. Right: Mean expression profile for each 
gene-cluster together with the expression profiles of note-worthy gene for each sample-cluster. The 
x-axes report the gene clusters and the boundaries of the relevant sample-cluster for each gene-cluster 
are indicated with the vertical dashed lines. 
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performed a two-way clustering: on the samples and genes simultaneously. In this way, we can cross- 
tabulate the samples against genes obtaining a simple and effective picture of significant gene expression 
patterns. Let us note that with conventional clustering techniques, the two-way clustering adds another 
dimension of complexity. Indeed, samples and gene expression profiles have different dimensions and 
scales and therefore it is necessary to tune the clustering parameters separately for each clustering way. 
On the other hand, the DBHT technique has no adjustable parameters and it is deterministic providing 
therefore a unique cross classification without any increase in complexity. The DBHT technique identifies 
180 gene-clusters from which we have extracted 6 clusters which are significantly differentiating for sample- 
clusters associated to FL, CLL and DLBCL, accordingly with a p- value threshold of 0.01 with Bonferroni 
correction. The expression profiles of these significant gene-clusters are reported in Fig. [5j We have then 
validated functional significance of these gene-clusters by performing a gene-ontology (GO) analysis to 
identify significant GO terms for biological processes [48j. (See supporting information for the statistical 
analysis methods and GO results.) Let us here report on some relevant genes, from each of the 6 significant 
gene-clusters, selected by choosing the most frequently appearing genes in the GO terms. Interestingly, 
these genes reveal some of biologically significant mechanisms that regulate growth of tumor cells, and 
that affect survival of respective lymphoma malignancy. In particular: 

• Gene cluster '44' (significant for sample-cluster '1'): This gene-cluster is up- regulated for sample- 
cluster '1' in comparison to the expressions in other sample-clusters associated to lymphoma. Sig- 
nificantly, one of its key genes is CDKl, which is a key player in cell cycle. It has been indicated 
that over-expression of CDKl is common in DLBCL cancer types, and it is therefore a potential 



therapeutic target 49 



Gene cluster '4' (significant for sample-cluster '4'): This gene-cluster particularly expresses for 
sample-cluster '4', which consists mostly of FL samples. Among the genes in this gene-cluster there 
is SYK which -indeed- has been indicated as a promising target gene for antitumor therapy for 
treating FL, where inhibition of SYK expression increases the chance of survival '50]. 

Gene cluster T (significant for sample-cluster '5'): Gene cluster 1 is particularly down-regulated 
for sample-cluster '5'. This gene-cluster contains TGF-Bl which is a well-known transcription 
factor to regulate proliferation, in particular a negative regulator of B-cell lymphoma which induces 



apoptosis of the tumor cells via NF-wB/Rel activity 51 . This suggests that suppression of the 
tumor cells by TGF-Bl would be lessened in sample-cluster '5' due to the down-regulation, and 
this may contribute to the decreased chance of survival observed in sample-cluster '5' in comparison 
to that of sample-cluster T. 

Gene cluster '4' (significant for sample-cluster '7'): This gene-cluster is slightly down-regulated 
for sample-cluster 7', and GO analysis extracts two genes, CDKNlB/p27^*Pi and CDKN2D/pl9, 
which are key tumor suppressor genes for aggresive neoplasms |52[|53| . The inhibited tumor sup- 
pressive role of these genes might have led to aggressive growth of tumor cells suggesting a plausible 
explanation for the poorest survival rate, observed for sample-cluster '7', with respect to the other 
DLBCL sample-clusters (see Tablejl]). Indeed, it has been suggested that p27 is associated to lym- 
phomagenesis through Skp2 [53| and Skp2 has been indicated as an independent marker to predict 
survival outcome in DLBCL [5 3 [[54] . 

Gene cluster '125' (significant for sample-cluster '9'): This gene-cluster shows distinct up-regulation 
pattern for sample cluster '9', and it includes an interesting gene 'IL-6'. IL-6 is known to be a central 
target gene in a synergistic crosstalk between NF-kB and JAK/STAT pathway, which is a unique 



feature for some DLBCL 47 . It is suggested that, these have implications for targeted therapies 



by blocking STAT3 expression, a gene that is activated by IL-6 |47[[55| 
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• Gene cluster '102' (significant for sample-cluster '11'): This gene-cluster particularly down-regulates 
the CLL sampe-cluster among all lymphoma-related clusters. Though it does not indicate a partic- 
ularly significant GO term (see Table 2 in the supporting material), it includes a number of genes 
related to regulating tumor cell growth for CLL (See Table 3 in supporting material for the list 
of genes). Among these genes, let us note IRFl, which is a well-known mediator for cell fate by 



facilitating apoptosis, and it is also a tumor suppressor 56 . As the expression of IRFl is slightly 



down-regulated, we suspect that this may contribute to the growth of CLL tumor cells. 

In conclusion let us stress that these results strongly indicate that the DBHT technique can detect 
relevant differentiation and aggregations in both cancer-samples and gene-clones revealing important 
relations that can be used for diagnosis, for prognosis and for treatment of these human cancers. 



Discussion 

In summary, we have introduced a novel approach, the DBHT technique, to extract cluster structure 
and to detect hierarchical organization in complex data-sets. This approach is based on the study of the 
properties of topologically embedded graphs built from a similarity measure. The DBHT technique is 
deterministic, it requires no a-priori parameters and it does not need any expert supervision. We have 
shown that the DBHT technique can successfully retrieve the clustering and hierarchical structure both 
from artificial data-sets and from different kinds of real data-sets outperforming in several cases other 



established methods. The application of the DBHT technique to a referential gene-expression dataset 36 
shows that this method can be successfully used in differentiating patients with different cancer subtypes 
from gene-expression data. In particular, we have correctly retrieved the differentiation into distinct 
clusters associated with cancer subtypes (PL, CL and DLBCL) along with a meaningful hierarchical 
structure. The DBHT technique provides a meaningful differentiation of the DLBCL cancer samples 
into four distinct clusters which turn out to correspond to different survival rates. The application of 
the DHBT clustering technique over the gene-clones identifies new groups of genes that play a relevant 
role in the differentiation of the cancer subtypes, and possibly in relevant genetic pathways which control 
survival/proliferation of the tumor cells. Differently from 36| which indicates GCB- and ABC-like DLBCL 
classification under thorough supervision with biological expertise, we have found instead, in a completely 
un-supervised manner, four subtypes of DLBCL with different expression signatures that differentiate 
significantly in their genetic mechanisms and biological features resulting in well distinct survival rates, 
hence providing a new perspective. It should be stressed that the DBHT technique is addressing the 
problem of data clustering and hierarchical study from a different perspective with respect to other 
approaches commonly used in the literature. It therefore provides an important alternative support in a 
field where the sensitivity of the results to the kind of approach is often crucial. The DBHT technique 
can be extended to more complex measures of dependency which may be also asymmetric. In our graph 
theoretic approach this can be handled by constructing topologically embedded directed graphs. Another 
extension may concern the use of graph-embedding on surfaces of genus larger than zero that will provide 



more complex networks and a richer data filtering 17 
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Figure 6. A schematic overview of the construction of the bubble tree, (i) An example of PMFG 

graph made of nine vertices V{G) — {wi, W2, W3, f4, ^5, wg, wy, wg, wg} and containing three separating 

3-cliques: fci, k2 and ^3. (ii) The separating 3-cliques have vertex sets: V{ki) = {v2,V3,V4}, 

V{k2) = {v2,V4,v^}, and V{k3) — {v3,V4,vq}. (iii) The separating 3-cliques identify four planar 

sub-graphs called "bubbles": 61, 62, ^3 and bi with vertex sets V(bi) = {wi, ^2, ^3, U4}, 

V{b2) = {v2,V3,V4,V5,VG,vg}, V (ts) = {v2, V4, V5 , vj} and 1^(64) = {v3,V4,V6,vs}- (iv) The graph can be 

viewed as a "bubble tree" made of four bubbles connected through three separating 3-cliques. 

Methods 



The PMFG is a weighted graph where edges uv have weights Wu^v which, in general, are similarity mea- 
sures (a larger weight Wu^v of edge uv corresponds to a stronger similarity between u and v). Furthermore, 
a distance du,v, or more generically, a non-negative dissimilarity measure is also associated to the edges. 
Specifically, the PMFG is a graph G{V, E, W, D) where V is the vertex set, E the edge set, W the edge- 
weight set and D the edge-distance set. A hierarchy in G can be built from a simple consequence of 
planarity which imposes that any cycle (a closed simple path with the same starting and ending vertex) 
must be either separating or non-separating 57 . If we detach from the graph the vertices belonging to 
a separating cycle then two disjoint and non-empty subgraphs are produced. The simplest cycle is the 
3-clique which is a key structural element in PMFGs. An example of PMFG is shown in Figj6] where 
the separating 3-cliques are highlighted. By definition, each separating 3-clique, fcp, divides the graph G 
into two disconnected parts, the interior G™ and the exterior G^^, that are joined by the clique itself. 
The union of one of these two parts and the separating clique is also a maximally planar graph. Such a 
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Figure 7. Illustration of the DBHT technique, (i) Construction of the directed bubble tree where 
directions are given to the 3-cliques /ci, ^2 and ky, (from Figjoj) accordingly with the largest weight W^^ 
and (Eqjljin Methods section). In this example we have two converging bubbles: ba — bi and 

6^ = 64. A unique set of vertices can be associated to each of the two converging bubbles ba and &^ 
where vertices shared by both the converging bubbles (i.e. the vertices V3 and V4) are assigned 
accordingly with the largest strength x (Eqj2] in Methods section) . (ii) All the other non-assigned 
vertices (i.e. v^, vg and uy) are associated to the cluster with minimum average shortest path length L 
(Eqjs] in Methods section) . (iii) The vertex set is uniquely divided into two clusters respectively 
associated to the two converging bubbles: V{a) — {wi, f2, ^3, W5} and V{l3) = {^4, Wg, W7, ''^g}- (iv) 
The hierarchical organization and the clustering structure can be represented with a dendrogram. 



presence of cliques within cliques provides naturally a hierarchy. The subdivision process can be carried 
on until all separating 3-cliques in G have been considered. The result is a set of planar graphs, that we 
call "bubbles" , which are connected to each other via separating 3-cliques, forming a tree |58j. In Figj6](a) 
the "bubble tree" and its construction are shown. In the bubble tree (Hf,) vertices bi represent bubbles 
and edges bibj represent the separating 3-clique, kp, which is connecting the two bubbles. A direction 
can be associated to each edge in Hj, by comparing the sum over the weights of the edges in the PMFG 
connecting the 3-clique kp with the two bubbles. Specifically, a direction can be associated to the edge 
bibj by comparing the connections of kp with the interior sub-graph G™ and the exterior sub-graph Gp^ 
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and considering the two weights 



Wl"^^^^ V Ag{v,u) (1) 



p /L^ 



where Ag{v,u) — Wy^ is the adjacency matrix of G. The direction is given toward the side with largest 
weight obtaining Hi,. (In the case of equal weights in the two directions, the two bubbles are joined 
into a single larger bubble.) In Hi, there are three different kinds of bubbles; (1) converging bubbles 
where the connected edges are all incoming to the bubble; (2) diverging bubbles where the connected 
edges are all outgoing from the bubble; (3) passage bubbles where there are both inwards and outwards 
connected edges. An example is provided in Fig[7| where we have two converging bubbles (5i and 64), 
one diverging bubble (63) and one passage bubble (62)- Converging bubbles are special being the end 
points of a directional path that follows the strongest connections and we consider them as the centers of 
clusters. Any bubble bi connected by a directed path in Hi, to a converging bubble ba belongs to cluster 
a. By construction, bubbles in cluster a form a subtree ha which has only one converging bubble ba 
and all edges are directed toward b^- This is a non-discrete clustering of bubbles because there can be 
multiple directed paths between bi and two or more converging bubbles ba, &/3,... . In Fig[7|ii) the two 
subtrees converging toward ba = bi and bfj = 64 are highlighted, it is clear that in this example bubbles 
62 and 63 are shared by the two subtrees. A non-discrete clustering of the vertex set V{G) can now be 
obtained by assigning to each vertex v the cluster memberships of the bubbles that contain it. In order 
to obtain a discrete clustering for V{G) we uniquely assign each vertex to the converging bubble which is 
at the smallest shortest path distance (see FigjTjfor a schematic overview). This is achieved in two steps. 
First, we consider the vertices in the converging bubbles. Some vertices belong to only one converging 
bubble and, in this case, they are assigned to it (e.g. in Fig [t] vertices vi and V2 are assigned to ba — 61 
and vertices vq, vg, are assigned to bjs = 64). Other vertices instead belong to more than one converging 
bubble (e.g. vertices and U4 in FigjT]) and in this case we look at the 'strength' of attachment 

H\V{ba)\-2) ' 

and assign each vertex to the bubble with largest strength. (The notation |l^(&a)| in Eqj2] indicates the 
number of vertices in the vertex set of ba and 3(|F(6ct)| — 2) is the number of edges in a bubble.) After 
this assignment, each converging bubble a has a unique set of vertices ^'^(0;). (There can be converging 
bubbles with an empty set of vertices and, in this case, there will be no clusters associated to them.) 
Second, we consider all the other remaining vertices (e.g. vertices W5, W7 and vg in Figj?]). A vertex v may 
belong to more than one subtree ha, hp... and, in this case, it is assigned to the converging bubble that 
has the minimum mean average shortest path distance 

L{v, a) = mean{l{v, u)\u e !/"(«) A w € v(ht)} (3) 

with respect to all other converging bubbles. Here l(v,u) is the shortest path distance on G from v to 
u (the smallest sum of distances dr,s over any path between v and u). We have now obtained a discrete 
partition of the vertex set V{G) into a number of sub-sets V{a), V{I3),... each respectively associated to 
the converging bubbles ba, bp,... . 

Once a unique partition of the vertex set into discrete clusters has been obtained, we can investigate 
how each of these clusters is internally structured and how different clusters gather together into larger 
aggregate structures. This can be achieved by building a specifically tailored linkage procedure that 
builds the hierarchy at three levels. 

1. Intra-bubble hierarchy: we must first assign each vertex v € V{a) to a bubble bi in the subtree ha. 
Vertices in the converging bubbles have been already assigned to the sets V^{a). For all remaining 
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vertices, the ones belonging to only one bubble are assigned to such bubble (e.g. vertices and vg 
in Figj?]). Whereas, vertices that are belonging to more than one bubble (e.g. vertex U5 in Figj?]) 
are assigned to the bubble that maximizes the strength bi) (Eq[2]). In this way for every cluster 
a and for each bubble bi in ha we have a unique vertex set V°'{bi) on which we can now perform a 
complete linkage procedure [59] by using the shortest path distances l{u,v) as distance matrix. 

2. Intra-cluster hierarchy: we perform a complete linkage procedure between the bubbles in h^ by 
using the distance matrix 

4(6„6j) = max{/(w,i;)|ueF"(6,)Ai;el^"(fej)} . (4) 

3. Inter-cluster hierarchy: we perform a complete linkage procedure between the clusters by using the 
distance matrix 

d"(a,;9) =max{/(u,w)|Me y(a)At;e y(/3)} . (5) 



With this procedure we obtain a novel linkage that starts from the discrete clusters and at higher level 
joins the clusters into super-clusters and, instead, at lower level splits the clusters into a hierarchy of 
bubbles and splits the bubbles into a hierarchy of elements. 
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Supplementary Information 

A Artificial data with a clustering structure 
A.l Preparation 

By using a multivariate Gaussian generator (MVG) and a multivariate Log-Normal generator[see Wang 
SS (2004) Casualty actuarial society proc. LXXXV] we have produced several synthetic time series which 
approximate a given correlation structure R* . Specifically, we have generated N stochastic time series 
yi{t) of length T {i — 1...N, t — 1...T) with zero mean and Pearson's cross-correlation matrix R that 
approximates R* . As for the starting correlation structures R* , we have used block diagonal matrices 
where the blocks are the artificial correlated clusters. The matrix R* has zero inter-cluster correlations 
pou* g^^^ large intra-cluster correlations p™* within the diagonal blocks. To this pre-defined cluster 
structure, we added a number Nran of random correlations unrelated to the clusters. We have chosen 
T — 10 X N and we added a noise term r]i{t) obtaining a new set of dataseries 



where ai = \J (yf) — (yt) is the standard deviation of ytit) and c is a constant used to tune the relative 
amplitude of noise. We have used a Normally distributed noise with probability distribution function 
p{ri) oc exp(— 77^/2) and a log-Normally distributed noise with probability distribution function ^(77) esc 
exp(— log(77)^/2). We have varied the relative amplitude of noise c from to 7 with constant intra- 
cluster correlation in R* at p™* = 0.9. We also have used power-law distributed noise, with probability 
distribution function p{ri) cx 1/77"+^. Specifically, this noise was numerically generated by using r](t) = 
±|77""(t)|(-i/"), where 77""(t) is a uniformly distributed noise in (0, 1] and the sign in front is chosen at 
random for each t with probability 50%. In this case, we have varied the relative amplitude of noise c 
from to 0.8 with exponent a = 1.5 and constant intra-cluster correlation p*"* = 0.9. We also have varied 
the exponent a between 1 to 3 keeping c = 0.1 and p'"* = 0.9. Examples of the obtained correlation 
matrices are reported in Figj8]for the MVG and Fig. [9] Log-normal multivariate generator. 

All these different manipulations produce a similar effect where by increasing the amplitude of noise 
or by decreasing the exponent or by reducing p™* , the Pearson's cross-correlation matrix R passes from 
a very well defined structure close to R* to a blurred structure where the average intra-cluster correlation 
((p™)) becomes smaller and finally it becomes equal to the average inter-cluster correlation ((p°")) and 
no correlation structure can be any longer observed. 

In summary, the simulated data were generated by combining the following possibilities. 

• Partitions: 

— Regular Partitions (all clusters of the same size), 

— Irregular partitions (clusters with different sizes). 

• Type of multivariate random variables: 

— Multivariate Gaussian Distribution; 

— Multivariate Log-normal Distribution. 

• Type of perturbation noises: 

— Univariate Gaussian Distribution; 

— Univariate Log-normal Distribution; 

— Univariate Power-law Distribution. 



(6) 
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(a) Gaussian Noise, c = 
0.9, = 0, JVra„ = 5 



0.2, p' 



(d) Log-normal Noise, c = 0.2, 

0.9, p™* = 0, TVran = 5 




(b) Gaussian Noise, c : 
0.9, /3°"* = 0,Nran = 5 



1.8, 




(e) Log-normal Noise, c = 0.8, p^ 
0.9, p™' = 0,JVra„ = 5 




(c) Gaussian Noise, c = 
0.9, /o°"* = 0,JVr„„ = 5 



4.2, p" 




(f) Log-normal Noise, c = 2, p^"* 
0.9, /o™* = 0,JVra„ = 5 




(g) Power-law Noise, c = 0.05, p^^ 

0.9, /)""* = 0, Q = 1.5, Nrar, = 5 



(h) Power-law Noise, c = 0.25, p^' 
0.9, = 0, a = 1.5, TVran = 5 



(i) Power-law Noise, c = 0.65, p*' 
0.9, = 0, Q = 1.5, Nj-an = 5 



Figure 8. Visualization of correlation matrices of synthetic data sets generated from MVG with 
partition of cluster sizes 4,8,16,32 and 64 where relative noise amplitude c has been varied to change the 
resolution of clustering structure. The parameters are specified underneath each figure. The first row 
adjusts c for Gaussian noises, the second adjusts for log-normal noises, and the third adjusts for 
Power-law noises with a = 1.5. 
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0.9, p""* = 0, JVran = 5 0.9, = 0, 7V™n = 5 0.9, f>°"' = 0, 7Vr„„ = 5 




(d) Log-normal Noise, c — 0.2, = (e) Log-normal Noise, c = 0.8, * = (f ) Log-normal Noise, c = 2.8, = 

0.9,^°"* = 0,Af„n = 5 0.9, = 0, JVran = 5 0.9, p"^* = 0, Nran = 5 




(g) Power-law Noise, c = 0.05, p^"* = (h) Power-law Noise, c = 0.25, p^^* = (i) Power-law Noise, c = 0.65, p**^* = 
0.9, p""* = 0,Q = 1.5,JVra„ = 5 0.9, p™* = 0,a = 1.5,Nran = 5 0.9, p""* = 0,a = 1.5,Afran = 5 



Figure 9. Visualization of correlation matrices of synthetic data sets generated from log-normal 
multivariates with partition of cluster sizes 4,8,16,32 and 64 where relative noise amplitude c has been 
varied to change the resolution of clustering structure. The parameters are specified underneath each 
figure. The first row adjusts c for Gaussian noises, the second adjusts for log-normal noises, and the 
third adjusts for Power-law noises with a = 1.5. 
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• Relative noise amplitude c. 

• Random background elements N^an- 

A. 2 Comparison with different clustering methods 

Fig. [To] shows the performance curves evaluated via adjusted Rand index for simulated data with multi- 
variate Gaussian distribution, and Fig. [TT] shows the performance curves for simulated data with multi- 
variate Log-normal distribution. The results for a wide range of dR > 0.1 for a broad set of combinations 
show that DBHT clustering outperforms the other clustering techniques except for Qcut which performs 



similarly to the DBHT. However, Fig. 12 shows that the DBHT clustering can outperform also Qcut for 
both Gaussian and Log- normally simulated data when an extreme cluster size differentiation is present. 
Specifically, in Fig. [12] there is a structure of eight small clusters of size 5 elements and one big cluster 
of 64 elements, and large number of random background elements [Nran — 25). Let us stress that the 



performance curves in Fig. 12 demonstrate that DBHT clustering is the only technique which delivers 



consistent and quality clustering outcomes in spite of the severe conditions applied. 



B Artificial data with a hierarchical Structure 
B.l Preparation 

In order to test the DBHT technique for the detection of the hierarchical structure, we have generated 
input matrices R* that are organized in a nested block-diagonal structure where block of small sizes are 
placed inside blocks of lager sizes. In particular, we looked at regular partitions of 16 'small' clusters 
containing 16 elements each with p™* — 0.95. These small clusters are merged to 'medium' clusters with 
pin* „ Q g^j^j further merged to 'big' clusters with p™* = 0.7. Finally, all clusters are merged to a 
single cluster with = 0.15. Similarly, we looked at irregular partitions with clusters of scaling sizes 
containing, 4, 4, 8, 8, 16, 16, 32, 32, 64 and 64 elements each, and the structures of small, medium, and 
big clusters were embedded by consecutively merging with Pi"*, p™*: P™* ^^'^ 

B.2 Comparison with different linkage methods 

We have simulated 30 different sets of multivariate Gaussian data series of length T = 10 x by using 
nested hierarchical block-diadonal input malices R* . An example of R* is provided in Fig[T3]^a) (same as 
Fig. 2 (a) in the paper). We have tested the capability of the DBHT method to recognize hierarchies by 
moving through the different hierarchical levels varying the number of clusters from only one at the top 
hierarchy to the number of elements at the lowest hierarchy. At each number of clusters we have measured 
the adjusted Rand index with respect to the 'large', 'medium' and 'small' partitions. Figs[l4|^b-d) show 
the average adjusted Rand index and the standard deviations over the 30 sets of synthetic data obtained 
by using the DBHT method, the average linkage method and the complete linkage method. One can 
observe in Fig[l4)^b) that all three methods successfully detect the 4 large clusters retrieving adjusted 
Rand index near to unity. At following hierarchical levels only the DBHT method consistently retrieves 
the maximum value for the adjusted Rand index respectively at the hierarchical partitions with 8 and 16 
clusters. Conversely, the other two methods achieve lower maximal values of the adjusted Rand index at 
a larger number of clusters inconsistent with the sizes of the synthetic data structure. We have tested 
other partitions and different levels of noise verifying that the DBHT method is consistently delivering 
good performances in comparison with the other established methods. An example, by using power law 
noise and clusters of scaling sizes respectively of 4, 8, 16, 32 and 64 elements is reported in Fig[T5)(a). The 
dendrograms for the DBHT, and the average linkage and the complete linkage methods are respectively 
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(c) Gaussian data with lognormal noise and (d) Gaussian data with lognormal noise 
regular Partition and irregular Partition 




0.4 0.6 

<dR> 



(e) Gaussian data with power-law noise (f ) Gaussian data with power-law noise and 
and regular Partition irregular Partition 



Figure 10. Adjusted Rand index for various data sets simulated via Gaussian (Normal) distribution 
with p™* = 0.9, — and Nran — 5. For each value of C, 30 data sets were generated in order to get 
stable statistics for < dR > and adjusted Rand score. 



reported in Figs 15 b,c,d). The comparison between the adjusted Rand indexes is reported in Fig 16 
One can see that, also in this case, the DBHT technique consistently outperforms the linkage methods. 
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(a) Lornogmal data with Gaussian noise (b) Lognormal data with Gaussian noise 
and regular Partition and irregular Partition 
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(c) Lognormal data with lognormal noise (d) Lognormal data with lognormal noise 
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(e) Lognormal data with power-law noise (f) Lognormal data with power-law noise 
and regular Partition and irregular Partition 



Figure 11. Adjusted Rand index for various data sets simulated via Log-normal distribution with 
pin* _ Q_g^ pou* _ Q j^j-^j Nran = 5. For each value of C, 30 data sets were generated in order to get 
stable statistics for < dR > and adjusted Rand score. 



26 




(a) Gaussian Data with Gaussian Noise 

1.2 



(b) Gaussian Data with Lognormal Noise 
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0.4 0.6 

<dR> 




0.4 0.6 

<dR> 



(c) Gaussian Data with Power-law Noise (d) Log-normal Data with Gaussian Noise 





(e) Log-normal Data with Lognormal Noise (f) Log-normal Data with Power-law Noise 



Figure 12. Adjusted Rand index for various data sets simulated via Gaussian and Log-normal 
distribution with p"" = 0.9, ~ and N^an — 25. This case refers to a cluster structure with eight 
clusters of size 5 elements, and one cluster of size 64 elements. For each value of C, 30 data sets were 
generated in order to get stable statistics for < dR > and adjusted Rand score. Figure (a) and (f) are 
the same of Fig.l in the paper and are here reported for completeness and for an easier comparison. 
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FlgUrG 13. Hierarchical clustering for uniform partition with a power law noise with exponent a — 1.1 and noise level 
c — 0.03 (a) Correlation template R* for a synthetic data structure with uniform sizes of 16 elements each, (b) Dendrogram 
associated with the DBHT hierarchical structure, (c) Dendrogram associated with the Average linkage, (d) Dendrogram 
associated with the Complete linkage. 

C Lymphoma data analysis 

C.l Emergence of GCB-like and ABC-like Patterns on PMFG 

Here, we report how the GCB-hke and ABC-Uke classification of DLBCL subtypes naturally emerges in 
the PMFG. This is shown in Fig. [17| where we can observe that ABC-like DLBCLs are dominant on the 
top of PMFG, and mainly occupy sample-cluster '7' and '9'. On the other hand, GCB-like DLBCLs are 
dominant on the center of PMFG, and mainly occupy sample-cluster '1', '5' and '7'. Among the sample- 
clusters associated with DLBCL, sample-cluster '1' and '5' are distinctively characterized by GCB-like 
DLBCL, sample cluster '9' is characterized by ABC-like DLBCL. Interestingly, sample cluster '1' and 
'5' indicate a further sub-classification of GCB-like DLBCL, and yet show superior survival rates than 
sample clusters associated ABC-like DLBCL, a more fatal subtype indicated by Alizadeh et al 2000 than 
GCB-like DLBCL (See Table 1 in the main paper). Furthermore, sample-cluster '7' is a mixture of these 
two subtypes, and yet shows much worse survival rates than sample-cluster '9' in which is present a 
much larger portion of ABC-like DLBCL (See Table 1 in the main paper). This clearly shows that the 
DBHT clustering indicates further meaningful subtypes of DLBCLs with respect to the GCB-/ABC-like 
classification of Alizadeh et al 2000. 

C.2 Analysis of significant gene-clusters for sample-clusters 

In order to look for significant gene-clusters which distinguish each sample-cluster, we have performed 
a series of statistical analysis on the gene-clusters of the data found by DBHT clustering. Specifically, 
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Figure 14. Adjusted Rand index for the comparison between the synthetic partition in FigllSIa) and the partitions retrieved 
by cutting the dendrograms from our DBHT clustering method at various numbers of clusters, (a) Comparison between the 
synthetic partition with the 4 large clusters and the partitions from DBHT, average linlcagc and complete linkage, (b) 
Comparison between the synthetic partition with the 8 medium clusters and the partitions from DBHT, average linkage and 
complete linkage, (c) Comparison between the synthetic partition with the 16 small clusters and the partitions from DBHT, 
average linkage and complete linkage. (d),(e),(f) Details of the upper figures showing the region where the DBHT has the 
maximum. The plots report average values over a set of the 30 trials, the error bars are the standard deviations. 



we have performed a combination of differential expression and enrichment analysis. Firstly, for a given 
sample-cluster, we have looked for a set of differentially expressed gene-profiles for a given cut-off p- 
value. Then we have calculated enrichment statistics for each gene-cluster by asking whether this cluster 
significantly enriches for the differentially expressed profiles. By varying the cut-off p-values, we have 
identified the most significant gene-cluster for the particular sample-cluster by choosing the gene-cluster 
that remains significantly enriched for the smallest cut-off. In order to identify differentially expressed 
profiles for each cut-off p-value, we have performed non-parametric Kruskal-Wallis one-way ANOVA test. 
The enrichment statistics has been evaluated by using the hypergeometric test with significance level of 



p-value 0.05, where the p-values were adjusted by Bonferroni correction. Fig. 18 reports the smallest 
cut-off p-values for each gene-cluster, for each sample-cluster. The list of labels for the most significant 
gene-clusters is shown in Table |3] Except for sample-cluster '2' and '6', each sample-cluster is assigned 
to a unique gene-cluster. For what concerns sample-cluster '2' this is most likely due to the small cluster 
size. Instead, we note that sample-cluster '6' corresponds to a collection of T Cell samples, and we suspect 
that the emergence of multiple significant gene-clusters is due to the broad spectrum of T cells in the 
physiology of lymphoma. 



C.3 Gene Ontology analysis on significant gene-clusters 

Among all significant gene-clusters, we have chosen a subset of gene-clusters which are associated to 
lymphoma malignancies, and we have performed Gene Ontology (GO) analysis on these gene-clusters in 
order to investigate associated biological processes. The analysis has been performed with significance 
level of p-value 0.05 on a plug-in software for Cytoscape, called BiNGO, and we applied Bonferroni 
correction. We have obtained a number of significant biological processes which are reported in Table [4] 
These biological processes indicate the underlying genetic mechanisms of which genes in the same gene- 
cluster share. For instance, gene-cluster '44' is associated to a large number of GO terms for cell cycles 
and cell cycle regulation. Indeed, this gene-cluster contains, for various phases, a key cell-cycle regulator 
CDKl whose over-expression pattern is a characteristic feature of DLBCL as discussed in the main paper. 
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On the other hand, none of the significant biological processes was captured by GO analysis for gene- 
cluster '102'. However, by no means, this cluster is un-significant for the sample-cluster. Indeed, as the 
enrichment analysis in Fig. 18 suggests, gene-cluster '102' remained enriched for very low p- value ~ 10~^, 
and it includes biologically significant genes for CLL such as IRFl as reported in the main paper. In 
Table [s] we report the full list of clones for the gene-cluster. 
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FlgUr© 16. Adjusted Rand index for the comparison between the synthetic partition in Fig |l5[ a) and the partitions retrieved 
by cutting the dendrogram from the DBHT clustering method at various number of clusters, (a) Oomparison between DBHT 
clustering and the synthetic partition with the 2 'large' clusters, (b) Comparison between DBHT clustering and the synthetic 
partition with the 5 'medium' clusters, (c) Comparison between DBHT clustering and the synthetic partition with the 10 'small' 
clusters. (d),(e),(f) Details of the upper figures showing the region where the adjusted Rand index from DBHT has the 
maximum. The plots (b), (c) and (d) report average values over a set of the 30 trials, the error bars are the standard deviations. 



Sample Cluster 


Gene Cluster 


1 


44 


2 


6,12,44,177 


3 


29 


4 


109 


5 


1 


6 


1,4,32,59,154 


7 


4 


8 


38 


9 


125 


10 


127 


11 


102 



Table 3. List of most significant gene-clusters for the sample-clusters. Sample clusters in bold italic 
font correspond to the clusters associated to lymphoma malignancies. 
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Figure 17. Visualization on the PMFG of the GCB-hke and ABC-like classifications as given by 
Alizadeh et al 2000. The labels inside the symbols correspond respectively to GCB-like DLBCL (G) 
and ABC-like DLBCL (A). The symbols are the same used to represent the sample clusters found by 
DBHT technique in Fig. 4 in the main paper . 
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Figure 18. Plot of cut-off p-valuc for Kruskal-Wallis one-way ANOVA test -vs- enriched gene-clusters. 
Circles represent the smallest cut-off p-value for individual gene-clusters. 
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Sample Cluster # 


GO ID 


i;orr p-v;iliie 


Gene Count 


GO description 


1 


22403 


.5.93E-20 


25/58 


cell cycle phase 


:Gene Cluster 44 


22402 


3.78E-18 


26/58 


cell cycle process 




279 


1.77E-16 


21/58 


M phase 




7040 


8.78E-15 


26/58 


cell cycle 




51301 


9.84E-11 


16/58 


cell division 




51721) 


1.12E-10 


18/58 


regulation of cell Cycle 




278 


1.19E-10 


17/.58 


mitotic cell cycle 




Ii99(i 


.5.99E-10 


27/.'i8 


organelle organization 




10043 


4.3(lE-08 


33/58 


cellular component organization 




280 


1.64E-07 


12/58 


nuclear division 




71)67 


1.64E-07 


12/58 


mitosis 




87 


2.31E-07 


12/.58 


M phase of mitotic cell cycle 




48285 


2.54E-07 


12/.58 


organelle fission 




Ii259 


1.88E-0(i 


15/58 


DNA metabohc process 




0974 


6.49E-06 


13/58 


response to DNA diunage stimulus 




51321 


l.llE-05 


8/58 






75 


1.5(lE-05 


8/58 


cell cycle checkpoint 




6281 


3.75E-05 


11/58 


DNA repair 




442fil) 


4.41E-05 


34/.58 


cellular macromolecule metabohc process 




48r,22 


9.05E-05 


25/58 


positive regulation of cellular process 




65009 


1.48E-04 


18/.58 


regulation of molecular function 




33rj54 


1.69E-04 


14/.58 


cellular response to stress 




5127() 


1.87E-04 


13/.58 


chromosome organization 




79 


2.05E-04 


6/58 


regulation of cychn-dependent protein kinase activity 




712(i 


2.42E-04 


7/58 


meiosis 




51327 


2.42E-04 


7/58 


M phase of melotic cell cycle 




5171(1 


2.92E-04 


17/58 


cellular response to stimulus 




5(1790 


.5.38E-04 


16/58 


regulation of catalytic activity 




48rjl8 


6.09E-04 


25/58 


positive regulation of biological process 




9(1304 


7.fi3E-04 


20/58 


nucleic acid metabohc process 




6139 


1.03E-03 


22/58 


nucleobase, nucleoside, nucleotide and nucleic acid metabolic process 




9987 


1.13E-03 


54/58 


ceUular process 




43171) 


1.5(lE-03 


34/58 


macromolecule metabolic process 




51341) 


1.54E-03 


6/58 


regulation of ligase activity 




7051 


2.17E-03 


5/58 


spindle organization 




44237 


2.65E-03 


38/58 


ceUular metabolic process 




65003 


3.28E-03 


13/58 


macromolecular complex assembly 




34641 


3.42E-03 


23/58 


cellular nitrogen compoimd metabohc process 




51329 


4.83E-03 


6/58 


interphase of mitotic cell cycle 




631(1 


.5.41E-03 


(i/58 


DNA recombination 




51325 


6.05E-03 


6/58 


interphase 




43933 


6.96E-03 


13/.58 


macromolecular complex subunit organization 




Ii26(i 


7.42E-03 


.3/58 


DNA hgation 




42127 


7.43E-03 


14/58 


regulation of cell proliferation 




48.519 


8.31E-03 


22/58 


negative regulation of biological process 




6807 


8.92E-03 


23/58 


nitrogen compoimd metabohc process 


4 


5(1851 


4.41E-02 


3/33 


antigen receptor-mediated signaling pathway 


: Gene Cluster 100 










5 


44261) 


3.1(iE-14 


107/206 


cellular macromolecule metabohc process 


: Gene Cluster 1 


43171) 


2.74E^11 


110/206 


macromolecule metabolic process 




44237 


4.72E-10 


123/206 


ceUular metabolic process 




43687 


4.4(lE-09 


52/206 


post-translational protein modification 




44238 


1.76E-08 


124/206 


primary metabolic process 




43412 


l.llE-07 


57/206 


macromolecule modification 




6464 


1.47E-07 


55/206 


protein modlflcation process 




44267 


3.12E-06 


65/206 


ceUular protein metabohc process 




8152 


4.17E-06 


128/206 


metabohc process 




5(1794 


8.38E-06 


131/206 


regulation of ceUular process 




(i468 


1.05E-05 


31/206 


protein amino acid phosphorylation 




9(1304 


2.33E-05 


49/20(i 


nucleic acid metabohc process 




1(1468 


2.74E-05 


77/20(i 


regulation of gene expression 




(i79(i 


4.94E-05 


37/206 


phosphate metabohc process 




6793 


4.94E-05 


37/206 


phosphorus metabohc process 




16310 


.5.55E-05 


33/206 


phosphorylation 




34641 


8.94E-05 


60/206 


ceUular nitrogen compound metabohc process 




6139 


1.23E-04 


54/206 


nucleobase, nucleoside, nucleotide and nucleic acid metabolic process 




31323 


1.34E-04 


89/206 


regulation of cellular metabohc process 




51171 


1.47E-04 


77/206 


regulation of nitrogen compound metabohc process 




1(1.556 


1.64E-04 


74/20(i 


regulation of macromolecule biosynthetic process 




16071 


1.S4E-04 


21/20(i 


mRNA metabolic process 




19219 


2.31E-04 


76/20(i 


regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process 




45449 


2.36E-04 


()9/206 


regulation of transcription 




6807 


2.73E-04 


61/20(i 


nitrogen compoimd metabolic process 




5(1789 


3.65E-04 


131/206 


regulation of biological process 




6(1255 


6.25E-04 


81/206 


regulation of macromolecule metabolic process 




7165 


6.6(lE-04 


54/206 


signal transduction 




19.538 


1.09E-03 


(i7/20(i 


protein metabohc process 




31326 


1.24E-03 


74/20(i 


regulation of ceUular biosynthetic process 




8(1091) 


1.32E-03 


83/20(1 


regulation of primary metabolic process 




19222 


1.35E-03 


89/206 


regulation of metabohc process 




9889 


1.71E-03 


74/206 


regulation of biosynthetic process 




l(i07l) 


2.21E-03 


34/20(i 


RNA metabohc process 




6357 


6.65E-03 


28/206 


regulation of transcription from RNA polymerase II promoter 




7049 


7.2(iE-03 


29/206 


ceU cycle 


7 


48102 


7.1(iE-03 


2/30 


autophagic cell death 


:Gene Cluster 4 










9 


6955 


.5.56E-09 


21/75 


immune response 


: Gene Cluster 125 




7.S2E-09 


25/75 


immune system process 




9611 


2.2(lE-05 


16/75 


response to wounding 




(i952 


2.22E-05 


17/75 


defense response 




(i95(l 


4.28E-05 


28/75 


response to stress 




23052 


.5.59E-05 


38/75 


signaling 




5(1896 


8.44E-05 


41/75 


response to stimulios 




(i954 


1.15E-04 


12/75 


inflammatory response 




Ii935 


3.37E-04 


9/75 


chemotaxis 




42331) 


3.37E-04 


9/75 


taxis 




4(1011 


.5.9(iE-04 


13/75 


locomotion 




23033 


1.55E-03 


28/75 


signaling pathway 




9(i07 


1.7:iE-03 
-'.■2 li:-0:} 


ID 7"i 


response to biotic stimulus 


. 1 

(CLL ( J1M:1 ■ 

: Gone CI i.-in' Id- 











Table 4. Over-represented GO terms for each of the significant gene-clusters of sample-clusters 1, 5, 7, 
9 (associated to DLBCL), 4 (associated to FL) and 11 (associated to CLL). 
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Clone name 

*LyGDI=Rho GDP-dissociation inhibitor 2=RH0 GDI 2: Clone=23 
*LyGDI=Rlio GDP-dissociation inliibitor 2=RH0 GDI 2; Clone=1240974 
*F'LI-l=ERGB=ots family transcription factor: ClonG=280882 
*FLl-l=ER(;B=ots family transcription factor: Clone=1354062 
(Arp2/3 protein complex snbnnit. p:i4-Arc (ARC34): Clone=1334980) 
(Unknown UG Hs. 28242 ESTs: Clonc=1303641) 
(Aconitasc— mitochondrial protein; Clone— 1353272) 
(B-actin, 421-689: Clone=136) 
(B-aetin,177-439; Clonc=137) 
(Rctinoblastoma-like 1 (pl07); Clone=249725) 
(B-actin, 657-993; Clone=145) 
*actin=cytoskeletaI gainma-actin; Clone=1240822 

*Similar to nuclear protein NIP45=potentiates NFAT-driven interleukin-4 transcription; Clone=512953 
actin— cytoskeletal gamma^actin; Clone=588637 
^Adenine nucleotide translocator 2; Clone=291660 
*Adenine nucleotide translocator 2; Clone=1241102 
*Calmodulin 1 (phosphorylase kinase, delta); Clone=549080 



Table 5. List of clones in gene-cluster 102, which corresponds to the most significant gene-cluster for 
sample-cluster 11 associated to CLL. 



